I certify that the following paper represents my own independent work and conforms with the guidelines of academic honesty described in the UMich student handbook.

Hereby I disclose that I used online LLM resources (ChatGPT and UM Maizey GPT for HS650) to help debug codes, generate ideas, understand concepts, and improve the quality of the paper.

1 Abstract

Telemedicine platforms have transformed healthcare delivery, generating vast amounts of free-text data from patient-physician interactions. This project leveraged 497,975 conversations from an Indonesian telemedicine platform to improve care efficiency through several strategies: simplifying pregnancy-related answers, prioritizing HIV-related consultations, and generating synthetic conversations. Pregnancy-related consultations were segmented into three themes, “Pregnancy Progression and Maternal Health,” “Early Pregnancy Signs and Hormonal Concerns,” and “Fertility, Conception, and Sexual Health” aim to streamline support for expectant mothers. HIV-related consultations were clustered into six themes, enabling the development of a predictive triage model with 95.72% accuracy to prioritize high-risk cases. Additionally, synthetic conversations in Indonesian and English were generated using OpenAI’s GPT-4o-mini API to create a contextual dataset for training language models, enhancing their relevance and accuracy. These findings demonstrate the potential of combining clustering, predictive modeling, and natural language generation to optimize telemedicine services and support efficient healthcare delivery.

2 Introduction

Online Health Consultation (OHC), also referred to as telemedicine, has improved healthcare delivery by enabling seamless communication between patients and physicians. As a result, it has generated an immense volume of free-text data, capturing the essence of patient and physician conversations. Based on these large amount of conversation data, this project aims to explore ways to improve telemedicine care, making healthcare delivery more efficient.

This project utilizes a dataset of 497,975 patient-physician conversations from an Indonesian OHC platform, spanning from December 8, 2014, to February 28, 2021. This dataset, publicly available on Mendeley Data, provides for exploring various dimensions of care management.

2.1 Dataset Overview

# Load the CSV file
set.seed(213) # Set seed for reproducibility
df_notes_all <- read.csv("~/Documents/UMich/Health Informatics/2024 Fall/HS650/final-project/datasets/indo-health-conv/Indo-Online Health Consultation-Multilabel-Raw.csv")
df_notes_all <- df_notes_all[!duplicated(df_notes_all),] # Remove duplicate
df_notes <- df_notes_all

# View data
datatable(head(df_notes, 1))
# Investigate dataset distribution
summary(df_notes)
##     title             question         question_date         answer         
##  Length:360530      Length:360530      Length:360530      Length:360530     
##  Class :character   Class :character   Class :character   Class :character  
##  Mode  :character   Mode  :character   Mode  :character   Mode  :character  
##                                                                             
##                                                                             
##                                                                             
##  answer_date           topics           topic_set             risk          
##  Length:360530      Length:360530      Length:360530      Length:360530     
##  Class :character   Class :character   Class :character   Class :character  
##  Mode  :character   Mode  :character   Mode  :character   Mode  :character  
##                                                                             
##                                                                             
##                                                                             
##       year      time_to_answer     
##  Min.   :2014   Min.   :   0.0000  
##  1st Qu.:2016   1st Qu.:   0.0000  
##  Median :2017   Median :   0.0000  
##  Mean   :2018   Mean   :   0.4164  
##  3rd Qu.:2019   3rd Qu.:   0.0000  
##  Max.   :2021   Max.   :1392.0000

The dataset comprises of 360,530 unique rows and 10 columns, including ‘question’, ‘answer’, ‘topics’, ‘risk’, ‘question_date’, ‘answer_date’, ‘topic_set’. The ‘risk’ column contains the risk level of the consultation, while ‘topics’ and ‘topic_set’ contain the topics discussed in the conversation. The ‘question_date’ and ‘answer_date’ columns represent the date of the question and answer, respectively.

2.2 Dataset EDA

The goal of this exploratory data analysis (EDA) is to understand the dataset’s characteristics, identify patterns, and explore potential insights. The analysis includes checking for missing values, examining the distribution of risk levels, and visualizing the length of questions and answers. Upon this EDA results, we will determine what hypotheses to test and the methods to use to address them with ultimate aim on improving telemedicine consultation services.

# Check for missing values
missing_values <- df_notes %>%
  summarise_all(~sum(is.na(.)))

missing_values
##   title question question_date answer answer_date topics topic_set risk year
## 1     0        0             0      0           0      0         0    0    0
##   time_to_answer
## 1              0

There are no missing values in the dataset, ensuring that all columns are complete and ready for analysis.

# Pie chart of 'risk' distribution
risk_distribution <- df_notes %>%
  group_by(risk) %>%
  summarise(count = n()) %>%
  mutate(
    proportion = count / sum(count),
    label = paste0(risk, "\n", round(proportion * 100, 1), "% (", count, ")")
  )

ggplot(risk_distribution, aes(x = "", y = count, fill = risk)) +
  geom_bar(stat = "identity", width = 1) +
  coord_polar(theta = "y") +
  geom_text(aes(label = label), position = position_stack(vjust = 0.5)) +
  labs(
    title = "Distribution of Risk Levels",
    x = NULL,
    y = NULL,
    fill = "Risk Level"
  ) +
  theme_minimal() +
  theme(
    axis.text = element_blank(),
    axis.ticks = element_blank(),
    plot.title = element_text(hjust = 0.5),
    panel.grid = element_blank()
  )

Most of the dataset consists of low-risk consultations. This distribution is essential for understanding the volume of consultations and the potential impact of different risk levels on the platform.

# Histogram of question length  with outlier removed
df_notes$question_length <- nchar(df_notes$question)

question_mean_length <- mean(df_notes$question_length, na.rm = TRUE)
question_sd_length <- sd(df_notes$question_length, na.rm = TRUE)

df_notes_question_filtered <- df_notes %>%
  filter(question_length >= (question_mean_length - 3 * question_sd_length) & question_length <= (question_mean_length + 3 * question_sd_length)) # Remove outliers

ggplot(df_notes_question_filtered, aes(x = question_length)) + 
  geom_histogram(binwidth = 100, fill = "skyblue", color = "black") +
  labs(
    title = "Histogram of Question Length (Outliers Removed)", 
    x = "Answer Length", 
    y = "Frequency"
  ) +
  theme_minimal() + 
  theme(plot.title = element_text(hjust = 0.5)) +
  geom_vline(aes(xintercept = mean(question_length, na.rm = TRUE)), color = "red", linetype = "dashed", linewidth = 1) # Add average line

The question given by patients has a wide range of lengths, with average falling between 300 to 500 characters (excluding the outliers). This gave us insight on the complexity of the questions and the amount of information patients giving out to the physician upon consultation.

# Histogram of answer length with outlier removed
df_notes$answer_length <- nchar(df_notes$answer)
answer_mean_length <- mean(df_notes$answer_length, na.rm = TRUE)
answer_sd_length <- sd(df_notes$answer_length, na.rm = TRUE)

df_notes_answer_filtered <- df_notes %>%
  filter(answer_length >= (answer_mean_length - 3 * answer_sd_length) & answer_length <= (answer_mean_length + 3 * answer_sd_length)) # Remove outliers

ggplot(df_notes_answer_filtered, aes(x = answer_length)) + 
  geom_histogram(binwidth = 100, fill = "skyblue", color = "black") +
  labs(
    title = "Histogram of Answer Length (Outliers Removed)", 
    x = "Answer Length", 
    y = "Frequency"
  ) +
  theme_minimal() + 
  theme(plot.title = element_text(hjust = 0.5)) +
  geom_vline(aes(xintercept = mean(answer_length, na.rm = TRUE)), color = "red", linetype = "dashed", linewidth = 1)

Meanwhile, the answer given by physician has average around 1500 characters signifying that the physician is giving detailed information to the patient with the limited information given by the patient.

# Number of distinct 'topic_set' values in 'low risk'
distinct_low_risk_topics_count <- df_notes %>%
  group_by(topic_set) %>%     
  filter(risk == "low") %>% 
  summarise(num_of_answer = n(), distinct_topics = n_distinct(topics)) %>% 
  arrange(desc(num_of_answer))

datatable(head(distinct_low_risk_topics_count), caption = "Low-Risk Topics and Their Distribution")

The table above shows topic sets from the low-risk category, such as kehamilan (pregnancy), which has 12,865 answers and 456 distinct topics. With so many different topics in one set, it becomes hard to identify the main themes of the conversations. Grouping these topics into broader categories can help simplify the analysis and make it easier to understand the general themes.

# Number of distinct 'topic_set' values in 'high risk'
distinct_high_risk_topics_count <- df_notes %>%
  group_by(topic_set) %>%     
  filter(risk == "high") %>% 
  summarise(num_of_answer = n(), distinct_topics = n_distinct(topics)) %>% 
  arrange(desc(num_of_answer))

datatable(head(distinct_high_risk_topics_count), caption = "High-Risk Topics and Their Distribution")

Similar to the low-risk categories previously, the table shows topic sets from higher-risk categories such as tuberkulosis (tuberculosis) with 3,933 answers and 501 distinct topics. These topic sets also contain a large number of distinct topics, making it difficult to identify the main themes of the conversations. Simplifying these topic sets by grouping them into broader categories would make it easier to analyze and prioritize care strategies.

# Get top 5 most frequent 'topic_set' for each 'risk' levels
top_5_topic_set_high_risk <- distinct_high_risk_topics_count %>% slice_head(n = 5)
top_5_topic_set_low_risk <- distinct_low_risk_topics_count %>% slice_head(n = 5)

# Date data parsing
df_notes$parsed_date <- as.Date(df_notes$question_date, format = "%d %B %Y, %H:%M")
df_notes$month_year <- as.Date(format(df_notes$parsed_date, "%Y-%m-01"), format = "%Y-%m-%d")

# Month-Year trend of consultation per 'risk'
monthly_risk_trend <- df_notes %>%
  group_by(month_year, risk) %>%
  summarise(consultation_count = n(), .groups = "drop") %>%
  arrange(desc(consultation_count), as.Date(month_year, format = "%Y-%m"), risk)

ggplot(monthly_risk_trend, aes(x = as.Date(month_year, format = "%Y-%m"), y = consultation_count, color = risk, group = risk)) +
  geom_line(size = 1) +
  labs(
    title = "Monthly Trend of Consultations per Risk Level",
    x = "Month-Year",
    y = "Number of Consultations",
    color = "Risk Level"
  ) +
  theme_minimal()

The monthly trend of consultations divided by risk level shown above. Low-risk consultations dominate, peaking significantly around 2018, followed by a gradual decline. In contrast, high-risk consultations maintain a steady but relatively low level throughout the period. This indicates a higher volume of low-risk cases being addressed on the platform, emphasizing the need for scalable solutions for low-risk categories while ensuring focused care for high-risk cases.

# Month-Year trend of top 5 'topic_set' in 'low risk' consultations
monthly_top_5_topic_set_low_risk <- df_notes %>%
  filter(topic_set %in% top_5_topic_set_low_risk$topic_set) %>%
  group_by(month_year, topic_set) %>%
  summarise(consultation_count = n(), .groups = "drop") %>%
  arrange(as.Date(month_year, format = "%Y-%m"), topic_set)

ggplot(monthly_top_5_topic_set_low_risk, aes(x = as.Date(month_year, format = "%Y-%m"), y = consultation_count, color = topic_set, group = topic_set)) +
  geom_line() +
  labs(
    title = "Monthly Trend of Top 5 Topic Sets in Low Risk Consultations",
    x = "Month - Year",
    y = "Number of Consultations",
    color = "Topic Set"
  ) +
  theme_minimal()

The linechart above shows monthly trends for low-risk topics like kehamilan (pregnancy), bayi (infants), intim-wanita (women’s intimacy), menstruasi (menstruation), and obat (medications). Pregnancy had the highest activity followed by menstruation, peaking in 2018, while other topics show steady but lower engagement over time.

# Month-Year trend of top 5 'topic_set' in 'high risk' consultations
monthly_top_5_topic_set_high_risk <- df_notes %>%
  filter(topic_set %in% top_5_topic_set_high_risk$topic_set) %>%
  group_by(month_year, topic_set) %>%
  summarise(consultation_count = n(), .groups = "drop") %>%
  arrange(as.Date(month_year, format = "%Y-%m"), topic_set)

ggplot(monthly_top_5_topic_set_high_risk, aes(x = as.Date(month_year, format = "%Y-%m"), y = consultation_count, color = topic_set, group = topic_set)) + 
  geom_line() +
  labs(
    title = "Monthly Trend of Top 5 Topic Sets in High Risk Consultations",
    x = "Month - Year",
    y = "Number of Consultations",
    color = "Topic Set"
  ) +
  theme_minimal()

Next, the monthly trends for high-risk topics like tuberculosis, HIV, diarhea, diabetes, and stroke. Tuberculosis saw the highest activity, peaking in 2018 before declining, while other topics had steady but lower activity. Diabetes and stroke have relatively lower numbers of consultations, indicating less frequent but ongoing discussions.

3 Hypotheses & Methods

After examining the content of these conversations, this project will focus on three different hypotheses that ultimately aim to improve telemedicine consultation services. The methods used to address these hypotheses include clustering analysis, predictive analysis, and synthetic data generation using OpenAI’s GPT-4o-mini API. Detailed explanations of each problem and the methods used are provided below:

  1. Simplifying Pregnancy-Related Answers:

Problem: Pregnancy-related consultations contain a wide range of topics, making it challenging to identify the general themes that has the potential of improving care efficiency.

Hypothesis: Group diverse pregnancy-related queries into general categories to streamline information and improve support for expectant mothers.

Method: Clustering analysis using k-means with TF-IDF vectorization to segment the pregnancy topic set and then visualize it using t-SNE.

  1. Triage Prioritization for HIV-Related Questions:

Problem: High-risk consultations, such as HIV-related questions, require immediate attention and prioritization to ensure timely and effective care. However, the current system does not have a mechanism to prioritize these cases.

Hypothesis: Segment HIV-related questions into priority levels (High, Medium, and Low) to enhance response times and ensure that urgent cases receive the attention they need.

Method: A two-step approach:

• Clustering analysis: Segment the ‘hiv’ topic set based on question characteristics utilizing k-means clustering with TF-IDF vectorization and then visualize it using UMAP.

• Predictive analysis: Predict triage priorities by creating model utilizing LDAs and SMOTE to balance the dataset and then evaluate the model using confusion matrix and ROC curve.

  1. Conversation Generator:

Problem: With the increasing usage of LLM, Indonesian specific conversational data is needed to train the LLM model to be relevant to the Indonesian context. However, there is a lack of Indonesian conversational data that can be used to train the model.

Hypothesis: Create a synthetic conversation utilizing LLM (OpenAI’s GPT-4o-mini API) to simulate patient-physician interactions aims to create question answering dataset in Indonesian language.

Method: Utilize OpenAI’s GPT-4o-mini API to generate synthetic conversations based on patient questions and physician answers from the dataset.

4 Results

4.3 Conversation Generator

# Set OpenAI API key
api_key <- config::get("openai_api_key")
Sys.setenv(OPENAI_API_KEY = api_key)
# Sys.setenv(OPENAI_API_KEY = "") # or put GPT API key here


# Function to generate conversation using OpenAI GPT API
generate_conversation <- function(patient_question, doctor_answer, language) {
  # Define the prompt for the conversation generation
  prompt <- paste(
    "I want you to role-play as a doctor and a patient.",
    "The main goal of this role-play is to create a conversation between the doctor and the patient that feels as natural as possible.",
    "Here is a question from the patient to the doctor as a precursor to the conversation:",
    patient_question,
    "Here is the doctor's response to the patient as a precursor to the conversation:",
    doctor_answer,
    "Create a conversation with 8 to 10 exchanges, adjusting for the complexity of the case.",
    "Remove any name and other personally identifiable information from the conversation, make the generated converstion anonymous.",
    "Make the conversation in:",
    language, 
    "DO NOT ADD OTHER INFORMATION THAT IS NOT ON THE PRECURSOR QUESTION AND ANSWER!",
    "REMOVE ANY NAME AND OTHER PERSONALLY IDENTIFIABLE INFORMATION FROM THE CONVERSATION!"
  )
  
  answer = create_chat_completion(
    model = "gpt-4o-mini",
    temperature = 0,
    messages = list(
      list(
        "role" = "system",
        "content" = "You are a professional doctor simulating a conversation with a patient."
      ),
      list(
        "role" = "user",
        "content" = prompt
      )
    )
  )

  conversation = answer$choices$message.content
  
  return(conversation)
}

The function generate_conversation is created to generate synthetic conversations using OpenAI’s GPT-4o-mini API. In order for the generated conversation to be as contextual, prompt engineering is done The prompt used in this function clearly specifies the role-play scenario, the patient’s question, the doctor’s answer, and the language of the conversation. The function includes a patient’s question and a doctor’s answer as input and generates a conversation between the patient and the doctor. The function also takes parameter on what language the conversation will be generated, in this case we will generate the conversation in Indonesian and English. As the dataset contains several names whether it is the patient or doctor name, we specify in the prompt to remove any name and other personally identifiable information from the conversation. This is to ensure that the generated conversation is anonymous, does not contain any sensitive information, and address potential biases if the synthetic dataset is trained to a model.

# Select sample data with only 'question' and 'answer' columns
set.seed(123)
df_notes_sample <- df_notes %>%
  filter(answer_length > 4000 | question_length > 4000) %>%
  dplyr::select(question, answer) %>% 
  sample_n(3, replace = TRUE, seed = 123) # Sample 3 rows

# Apply the function to each question and answer in sample data
df_notes_sample$generated_conversation_id <- mapply(
  generate_conversation,
  df_notes_sample$question,
  df_notes_sample$answer,
  'indonesian'
)

df_notes_sample$generated_conversation_en <- mapply(
  generate_conversation,
  df_notes_sample$question,
  df_notes_sample$answer, 
  'english'
)

# View the data frame with generated conversations
datatable(head(df_notes_sample, 1), caption = "Generated Conversations")

As a sample, we gathered three rows of data with long questions and answers. The function generate_conversation is then applied to each question and answer in the sample data to generate synthetic conversations in Indonesian and English. The generated conversations are stored in the generated_conversation_id and generated_conversation_en columns, respectively. The generated conversations are then displayed in the table above.

Overall, we have synthetic conversations in Indonesian and English with names and other personally identifiable information are removed to ensure anonymity. These synthetic conversations has the potential to be used as a question answering dataset in Indonesian language, which can be used to train the LLM model to be relevant to the Indonesian context. This will help improve the quality of the telemedicine platform by providing relevant and accurate responses to patient questions.

5 Discussion

Upon analyzing the content of the telemedicine consultations, we identified three key hypotheses that could improve the efficiency and effectiveness of the telemedicine platform. The hypotheses are related to simplifying pregnancy-related answers, triage prioritization for HIV-related questions, and generating synthetic conversations to train the LLM model.

The first hypothesis on simplifying pregnancy-related answers was successfully addressed by segmenting the pregnancy-related consultations into three distinct clusters. The clusters were named based on the main themes of the conversations, such as “Pregnancy Progression and Maternal Health,” “Early Pregnancy Signs and Hormonal Concerns,” and “Fertility, Conception, and Sexual Health.” The t-SNE visualization showed that the clusters were well-separated, indicating that the segmentation was successful. This will help streamline information and improve support for expectant mothers by categorizing diverse pregnancy-related queries into general categories. The drawback of this analysis is that the clustering is done based on the answers given by the physician, which might not be the best representation of the patient’s question. Future work could involve clustering based on both question and answer to get a better representation of the conversation.

Second hypothesis on triage prioritization for HIV-related questions was successfully addressed by segmenting the HIV-related consultations into six distinct clusters. The clusters were named based on the main themes of the conversations, such as “Sexual Relationships and Protection,” “Unprotected Sexual Activity Risks,” “Testing and Diagnosis,” “HIV/AIDS Awareness and Symptoms,” “Potential Exposure to Viruses,” and “Treatment and Recovery Support.” The UMAP visualization showed that the clusters were well-separated, indicating that the segmentation was successful. The triage priorities were assigned to each cluster based on the urgency and severity of the themes identified. The distribution of triage priorities showed that the majority of consultations were classified as Medium priority, followed by High and Low priorities. For the predictive analysis, we created a model to predict the triage priorities of the HIV-related consultations, which demonstrated high accuracy, sensitivity, and specificity. This will help improve the efficiency of the telemedicine platform by ensuring that high-risk consultations are addressed promptly. The drawback of this method might related to how we conclude the triage priority, as it is based on the most frequent keywords of the clustering analysis which might not be the best representation of the urgency of the question. Future work could involve more in-depth analysis of the question to determine the urgency of the question.

Last hypothesis on generating synthetic conversations was successfully addressed by creating synthetic conversations in Indonesian and English using OpenAI’s GPT-4o-mini API. The synthetic conversations were generated based on patient questions and physician answers from the dataset. The generated conversations were stored in the generated_conversation_id and generated_conversation_en columns, respectively. These synthetic conversations can be used as a question answering dataset in Indonesian language, which can be used to train the LLM model to be relevant to the Indonesian context. This will help create contextual LLM model for telemedicine platform that can provide relevant and accurate responses to patient questions. The drawback of this method is that the synthetic conversation might not be as accurate as the real conversation, as it is generated based on the question and answer from the dataset, which has limited context and information. Future work could involve more comprehensive data preprocessing and prompt engineering to improve the quality of the synthetic conversations.

6 Conclusion

In conclusion, the hypotheses on simplifying pregnancy-related answers, triage prioritization for HIV-related questions, and generating synthetic conversations were successfully addressed through data analysis, machine learning techniques, and text generation using lLM. The segmentation of pregnancy-related consultations and HIV-related consultations into distinct clusters will help streamline information and improve support for expectant mothers and HIV patients. The predictive model for triage prioritization of HIV-related consultations will help ensure that high-risk consultations are addressed promptly. The synthetic conversations generated in Indonesian and English will help create a contextual LLM model for the telemedicine platform. These findings will help improve the efficiency and effectiveness of the telemedicine platform by providing relevant and accurate data classification and prediction. Future work could involve more in-depth analysis of the questions and answers to improve the clustering and predictive models, as well as more comprehensive data preprocessing and prompt engineering to improve the quality of the synthetic conversations.

7 References

  1. Juanita, Safitri; Purwitasari, Diana; Purnama, I Ketut Eddy; Purnomo, Mauridhi (2023), “Doctor’s Answer Text Dataset in Indonesian Contains Information on Medical Interview Patterns”, Mendeley Data, V1, doi: 10.17632/p8d5bynh3m.1
  2. S. Juanita, D. Purwitasari, I. Ketut Eddy Purnama, A. Famasya Abdillah and M. H. Purnomo, “Topic Modeling for Online Health Consultation on Low-Risk Diseases,” 2024 International Seminar on Intelligent Technology and Its Applications (ISITIA), Mataram, Indonesia, 2024, pp. 196-201, doi: 10.1109/ISITIA63062.2024.10667722.
  3. Wang, Junda & Yao, Zonghai & Yang, Zhichao & Zhou, Huixue & Li, Rumeng & Wang, Xun & Xu, Yucheng & Yu, Hong. (2024). NoteChat: A Dataset of Synthetic Patient-Physician Conversations Conditioned on Clinical Notes. 15183-15201. 10.18653/v1/2024.findings-acl.901.
  4. OpenAI API Documentation. https://platform.openai.com/docs/overview